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Abstract 

Genetic linkage maps are indispensable tools in genetic, genomic and breeding studies. As one of genotyping-by- 
sequencing methods, RAD-Seq (restriction-site associated DNA sequencing) has gained particular popularity for 
construction of high-density linkage maps. Current RAD analytical tools are being predominantly used for typing 
codominant markers. However, no genotyping algorithm has been developed for dominant markers (resulting from 
recognition site disruption). Given their abundance in eukaryotic genomes, utilization of dominant markers would greatly 
diminish the extensive sequencing effort required for large-scale marker development. In this study, we established, for the 
first time, a novel statistical framework for de novo dominant genotyping in mapping populations. An integrated package 
called RADtyping was developed by incorporating both de novo codominant and dominant genotyping algorithms. We 
demonstrated the superb performance of RADtyping in achieving remarkably high genotyping accuracy based on 
simulated and real mapping datasets. The RADtyping package is freely available at http://www2.ouc.edu.cn/mollusk/ 
detailen.asp?id = 727. 
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Introduction 

Genetic linkage maps are indispensable tools in genetic, 
genomic and breeding studies. A high-resolution linkage map is 
exceptionally valuable in many applications such as fine-scale 
quantitative trait locus (QTL) mapping, characterization of 
recombination hotspots, comparative genome analysis, genome 
scaffolding and marker-assisted selection. The advent of next- 
generation sequencing technologies has greatly stimulated the 
development of a variety of genotyping by sequencing methods 
that enable simultaneously discovering and genotyping of 
thousands of single nucleotide polymorphisms (SNPs). In 
particular, RAD (restriction-site associated DNA) has gained 
popularity for linkage map construction [1], and several 
methods with simpler library preparation protocols have been 
developed, such as 2b-RAD [2] and ddRAD [3]. With 
increasing demands for application of the RAD method in 
poorly-studied organisms, several tools such as Stacks [4], 
RApiD [5] , RADtools [6] and iML [7] have been developed to 
analyze RAD data de novo (i.e., in the absence of a reference 
genome). However, these tools are being predominantly used 
for scoring codominant markers; while for dominant markers, 
which are scored as "presence" or "absence" due to the 



disruption of recognition sites, available tools basically only 
output the raw count of tag presence or absence. For these 
tools, the accuracy of dominant genotype calls remains 
unclear. No experimental validation has been performed to 
determine what percentage of the observed tag presence/ 
absence polymorphism is really due to restriction site 
heterozygosity but not the variation of sequencing depth. A 
statistical framework for de novo dominant genotyping remains 
to be established. It has been shown that dominant markers 
can provide a large amount of additional genotypic infor- 
mation (e.g., accounting for ~40% of total markers in the 
threespine stickleback; [8]), the utilization of which would 
greatly diminish the extensive sequencing effort required for 
large-scale marker development. The implications of domi- 
nant marker variation have been explored in several recent 
studies [9-11]. In the present study, we established, for the 
first time, a novel statistical framework for de novo dominant 
genotyping in linkage mapping studies. An integrated 
package called RADtyping was developed, which could 
achieve accurate de novo codominant and dominant genotyp- 
ing in mapping populations. The performance of RADtyping 
was thoroughly evaluated using both simulated and real 
mapping datasets. 
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Figure 1. An overview of the RADtyping approach for de novo codominant and dominant genotyping in a mapping population. 

Representative reference sites are obtained by assembling parental sequencing reads into "locus" clusters. These sites are further classified into 
parent-shared and parent-specific sites for subsequent codominant and dominant genotyping. Main principles of codominant and dominant 
genotyping algorithms are shown in flowcharts, and more details are described in the Methods section. 
doi:1 0.1 371 /journal.pone.0079960.g001 



Results and Discussion 

Overview of the RADtyping methodology 

The principle of RADtyping is outlined below (also shown in 
Figure 1) and a full description of the genotyping algorithms is 
available in the Methods section. 

Representative reference reconstruction. Reference sites 
are reconstructed using sequencing data from both mapping 
parents. Briefly, all pre-processed reads from two mapping parents 
are combined and assembled into exactly matching read clusters 
(i.e. representing individual alleles), and then these "allele" clusters 
are further merged into "locus" clusters by allowing certain 
mismatches. A collection of consensus sequences from all the 
"locus" clusters comprises the representative reference sites. These 
sites are further classified into parent-shared and parent-specific 
sites for subsequent codominant and dominant genotyping, 
respectively. 

Codominant genotyping. To obtain high-quality reference 
sites, parent-shared reference sites are first filtered by excluding 
sites that are either not supported by parental reads in sufficient 



depth or derived from repetitive genomic regions. Here the iML 
algorithm recently developed by our group is adopted to exclude 
repetitive sites from genotyping, the performance of which has 
been thoroughly evaluated [7]. Once the high-quality reference 
sites are determined, sequencing reads from the two parents and 
their progeny are separately mapped to those sites. For each locus, 
posterior probabilities are calculated for two possible genotypes 
(i.e., homozygote or heterozygote) and then a likelihood ratio test 
is performed to determine the most likely genotype. 

Dominant genotyping. Unlike codominant markers, domi- 
nant markers are scored as "presence" or "absence" to reflect 
whether a recognition site is intact or disrupted. Similar to 
codominant genotyping, parent-specific reference sites that are not 
supported by parental reads in sufficient depth are first filtered out. 
In addition, reference sites that are not sequenced to sufficient depth 
in the progeny are also excluded to avoid incorrect "absence" calls 
from these low-coverage sites. Sequencing reads are then mapped to 
the high-quality reference sites obtained, and the "absence" or 
"presence" of each site is determined using the threshold l d to 
prevent incorrect "presence" calls from sites with misaligned reads. 
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RADtyping performance on simulation data 

The performance of RADtyping was first evaluated using in 
silico sequencing datasets generated from a pseudo Arabidopsis Fi 
mapping population (see methods for details). The aims of our 
simulation analysis were (i) to evaluate the performance of 
RADtyping in three key aspects (i.e., genotype coverage, removal 
of repetitive sites and genotyping accuracy), and (ii) to help devise 
a cost-effective sequencing strategy for linkage mapping studies by 
balancing sequencing cost and genotyping accuracy. The simula- 
tion results showed that with increased sequencing depth for 
parents and their progeny, the percentage of ungenotyped loci 
rapidly decreased and reached a "stable" level at the sequencing 
depth combination of &20x for parents and &15x for progeny 
where a majority of target loci (>93% for codominant loci and 
>96% for dominant loci) could be readily genotyped (Figure 2a,d, 
Table Sla,d). The high-quality reference sites reconstructed for 
genotyping almost exclusively derived from unique genomic 
regions (e.g. >98% at the sequencing depth of >l()x for both 
parents and progeny; Figure 2b,e, Table Slb,e), suggesting that 
repetitive sites could be efficiently filtered out by our genotyping 
algorithms. The rate of genotyping error gradually decreased with 
the increase of sequencing depth. For codominant genotyping, 
genotyping accuracy could reach ~97% at the sequencing depth 
of 20 x for both parents and progeny (Figure 2c, Table Sic), while 
for dominant genotyping, ~98% could be achieved at a much 
lower sequencing depth (10 x; Figure 2f, Table S If), suggesting 
that dominant loci could be more reliably genotyped than 
codominant loci when the average sequencing depth was low. In 
addition, our simulation results suggest that a minimal sequencing 
depth of 20 x for both parents and progeny should meet the 
desired level of genotyping accuracy in large-scale linkage 
mapping studies. 

RADtyping performance on real data 

The performance of RADtyping was further evaluated using a 
sequencing dataset generated from an F[ mapping population of 
Zhikong scallop, Chlamys farreri [12]. The sequencing depth for 
progeny ranged from 13.4 to 23.8 with an average of 16.75, 
whereas the parents were sequenced to a much deeper depth 
(70~80x). Clustering parental reads resulted in 181,625 repre- 
sentative reference sites. After a series of quality-filtering steps, 
117,113 parent-shared and 35,799 parent-specific sites composed 
the list of high-quality reference sites. These reference sites 
contained 92% of the unique sites inferred from a preliminary 
reference genome we recently generated for C. farreri (870 Mb, 
equivalent to ~70% genome coverage; available at http:/ /ipl.ouc. 
edu.cn/fuxiaoteng/cf_SRA_data; [12]), suggesting that unique 
sites were well represented in the obtained high-quality reference 
sites. In total, 7,458 polymorphic markers were identified (Table 1), 
of which 6,842 that were heterozygous in at least one parent were 
suitable for linkage analysis, including 2,196 codominant and 
4,646 dominant markers. Obtaining more dominant markers than 
codominant markers should be related to the low-sequencing 
coverage of progeny. RAD sites with low read depth are more 
likely to be genotyped by dominant algorithm than codominant 
algorithm. For codominant RAD sites, when we count all sites that 
have read coverage in at least 80% of progeny (regardless of their 
genotyping status), the number of codominant markers increase to 
8,679, representing 1.7 times the number of dominant markers 
(5,251). Genotyping accuracy was further evaluated by amplicon 
(Sanger) sequencing of eight codominant and eight dominant 
markers for two parents and four progeny. The average validation 
rate was 96% and 97% for the codominant and dominant 
markers, respectively (Table 2). Particularly, all 2b-RAD geno- 



types in parents could be validated by the Sanger method, 
suggesting that genotyping accuracy can be substantially improved 
through deep sequencing (~50x). For the validated dominant 
markers, SNPs that disrupted the recognition sites were also 
confirmed (Table 3). 

Currently, it remains difficult to evaluate the accuracy of RAD 
genotyping tools at a large scale due to lack of a gold standard 
RAD mapping dataset with pre-known true genotypes (especially 
for dominant markers). To circumvent this problem, we generated 
a mapping dataset by 2b-RAD sequencing of replicate libraries 
that were independently prepared from two scallop parents 
(Argopecten irradians medians and Argopecten purpuratus) and ten of 
their Fj hybrid progeny. Measuring genotyping consistency 
between these replicate datasets enables providing a good proxy 
for the overall genotyping accuracy of RADtyping. In total, 5,533 
mappable markers were identified by requiring being genotyped in 
both datasets for at least 80% of progeny, including 1,561 
codominant markers and 3,972 dominant markers (present in one 
parent and absent in another) in accordance with Mendelian 
segregation. Very high genotyping consistency was revealed 
between the two replicate datasets with on average 96% for 
codominant markers (Table 4) and 99% for dominant markers 
(Table 5), which further substantiates the superb performance of 
RADtyping in achieving accurate de novo codominant and 
dominant genotyping in mapping populations. The finding of 
slightly higher consistency for dominant markers than codominant 
markers coincides with our previous simulation results, i.e, 
dominant loci can be more reliably genotyped than codominant 
loci at the same sequencing depth. Note, heterozygous loci showed 
relatively lower genotyping consistency in progeny than parents 
(Table 4), which is most likely related to the difference of average 
sequencing depths between parents (181-235 x) and progeny (22- 
46 x). 

Future directions for RADtyping improvement 

In the present study, the performance of RADtyping was 
evaluated only based on 2b-RAD datasets. Though we expect that 
RADtyping should be generally applicable to various kinds of 
RAD data, it remains to be tested. Our de novo genotyping 
algorithms currently assume that RAD data approximately follow 
a mixed Poisson (or normal) distribution. However, this assump- 
tion may not be appropriate for all kinds of RAD data [9]; 
therefore incorporating alternative distribution models (e.g. 
negative binomial) seems a better choice to further improve the 
utility of this program. 

Currently, RADtyping only deals with dominant markers 
showing 1:1 segregation pattern in progeny, i.e., parental 
genotypes are A- for one parent and - for another, where - 
represents an unsequencable allele resulting from a mutation in 
the restriction site. While for dominant markers showing 1:2:1 
segregation pattern (i.e., A- x A-), a statistical genotyping approach 
still needs to be established. The forseeable most challenging step 
is to accurately distinguish AA from A- especially in cases where 
deep sequencing is not feasible. 

In conclusion, RADtyping enables accurate de novo genotyping 
of codominant and dominant markers in mapping populations, 
which would gready facilitate construction of high-resolution 
linkage maps in organisms lacking extensive genomic resources. 

Materials and Methods 

Simulated and real sequencing data 

For simulation analysis, a pseudo F! mapping population 
composed of 1 00 progeny was created in silico for the model plant 
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Figure 2. Evaluation of the performance of RADtyping using a pseudo F, mapping population. The simulated population was created by 
a crossing of two Arabidopsis plants with predefined SNPs in their genomes and progeny were subject to in silico sequencing together with their 
parents at different sequencing depths with sequencing errors enabled. De novo codominant and dominant genotyping was evaluated in three key 
aspects: genotype coverage (a, b), removal of repetitive sites (b, e), and genotyping accuracy (c, f). 
doi:1 0.1 371 /journal.pone.0079960.g002 



Table 1. Summary of polymorphic markers obtained by 2b-RAD sequencing of a C. farreri mapping population. 







Segregation pattern 


Total marker no." 


Marker no. in accord with 
Mendelian segregation 15 


Mapped marker no. 


Codominant marker 


(AAxaa) or (aaxAA) 


203 


n.a. 


n.a. 




(Aaxaa) or (aaxAa) 


1882 


1432 


1166 




(AaxAa) 


314 


233 


187 


Dominant marker 


(AAx-) or (-xAA) 


413 


n.a. 


n.a. 




(A-x-) or (-xA-) 


n.a. 


3216 


2453 




(A-xA-) c 


n.a. 


1430 


n.a. 



a Total marker no. refers to all polymorphic markers reported by RADtyping regardless of whether they follow Mendelian segregation in progeny. 

b For dominant markers, only those in accord with Mendelian segregation were scored to ensure the correct assignment of markers to different segregation patterns. 

c This segregation type was scored separately apart from the main pipeline. 

doi:1 0.1 371 /journal.pone.0079960.t001 
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Table 2. Sanger validation of 2b-RAD genotypes. 





Maker type 


Genotype class 


2b-RAD genotype 


Validated by Sanger sequencing 


Validation rate 


Codomlnant 


Parent (depth: 49-77 x) 


Heterozygote 


8 


8 


100% 




Homozygote 


8 


8 


100% 


Progeny (depth: 14-21.2x) 


Heterozygote 


12 


10 


84% 




Homozygote 


20 


20 


100% 


Total 




48 


46 


96% 


Dominant 


Parent (depth: 37-63 x) 


Presence 


8 


8 


100% 




Absence 


8 


8 


100% 


Progeny (depth: 1 3.7-22 x) 


Presence 


13 


12 


92% 




Absence 


8 


8 


100% 


Total 




37 


36 


97% 



doi:1 0.1 371 /joumal.pone.0079960.t002 



species Arabidopsis thaliana. Approximately 1 % of the BsaXI sites in 
the Arabidopsis genome were randomly chosen as polymorphic 
sites. For each polymorphic locus, a parental genotype was 
designated as either homozygote or heterozygote at a rate of 50%, 
while for progeny, genotypes were randomly generated by 
conforming to the law of independent recombination. In silico 
sequencing was performed for the pseudo Arabidopsis mapping- 
population. Different sequencing depths were evaluated for the 
parents (10 x to 50 x) and the progeny (5x to 30 x). Each allele 
was "sequenced" to a depth determined by a draw from a Poisson 



distribution. For each "sequenced" read, the global error rate, 
which increased linearly along the sequence, was set to 1%. 

Two real sequencing datasets were utilized in this study. The 
first dataset was retrieved from our recent linkage mapping study 
for C.farreri [12], which was generated by 2b-RAD sequencing of 
two parents and 96 F! progeny. Briefly, 2b-RAD libraries were 
prepared by following the protocol developed by Wang et al. [2] . 
For the parents, standard BsaXI libraries were constructed, while 
for the progeny, reduced representation (RR) libraries were 
constructed using adaptors with 5'-NNT-3' overhangs to target 



Table 3. Codominant and dominant SNPs confirmed by Sanger-based amplicon sequencing. 





Marker 


BsaXI tags 3 


Forward primer (5 — >3 ) 


Reverse primer (5 — >3') 


Codominant 


ml 19628 


TGGTAGGAAACTTTTTCTCCTCGT(C/T)CC 


GCAGAGTTGGCAAAGGGG 


ACACGGCCAGAACCCAGC 


f83678 


AAACAGTTTACATGGACTCCCC(T/C)AACA 


ACTGCTCCCACCTCTGAC 


GTAGTCCCAGTTGCTCCA 


m12011 


TCAAAGATAACCCTATCTCC(G/A)CTATAG 


TTCTGCTTGTCCACACACGACCTCC 


ACTGCTGCTGTTTCTTACACTTATG 


f47186 


CATGG (G/C) GTCACTTGATCTCCCGACAGA 


CCCCTTACCTTCACTGT 


TGTGACAACACTGACTCG 


f79797 


TAACCGATGACGAGTACTCCGAAGT(G/A)T 


GGTCTGGTACAAACAAATGAC 


AGACAGACTGCTTTGCCA 


m81459 


GCTAACGCCACAAAAACTCCC(C/A)GAGAG 


GAAGTTCAAAAGGGAGTA 


GAGCAATGTTAGGGCTAA 


m386 


ACCT (T/G) CGAAACTAATTCTCCGAAATGT 


ATCAAGCGTCAATATAACCTG 


AGAAGCACAACACTGCTGTAC 


f 12046 


TA (T/A) ATAGTTACTGATCCTCCAAGATTT 


TTAGGTGTAAGTAAGGAC 


TTTAGTCGGCTAGTATTG 


Dominant 


df33179 


ACCTGCTTCACAGAAGCT(C/G)CTTCGAAT 


TCTACCGACCGACGGACTGA 


ACTAGTTCCCTGTTCTTTTACTGAT 


dm25086 


CATTCCACCACCCCACC(T/G)CCCACCCAA 


GATAAACGACTGAGTGGAAC 


GGTGCGCTAATGGAAATA 


df29520 


CGTTGCAGAACTCAGGC(T/A)CCGCCCCTC 


TAACGTAGCGACATCAGG 


ATTGAGTTCAGGAGTTTCC 


dm27070 


CACAAACACACATTAAC(T/C)CCTGACATT 


AAC T AAAGCTAC CCAGACAC 


GACGCTAGATGGATGACA 


df4428 


CTGATAAGGACCGCTGCT(C/T)CCCCTCTC 


ATCATTACAGTAACTTCCACTCGGT 


ACGGCTGACTACCTGTAAACATTGA 


df 1 2778 


TTCATTTGAA(C/T)TCTCCCTCCTTTAATG 


ATTACACCTGCATGAACAA 


TAATGAACTGTGGGACGC 


df9608 


TCTACGTATA(C/T)ATTTCCTCCCACTCCA 


CTGATGGCAAGTTGTATCCAGAATG 


CATAATATAAGACCAAATCATCACA 


dm25622 


CATTGAGCT(A/T)CCCAGTCTCCAGACCTC 


CTTATGCTTACAAAGGAGGT 


ATCTAAGTTGTTGGGCAGT 



a BsaXI restriction sites are highlighted in bold and SNP alleles are indicated in parentheses. 
doi:1 0.1 371 /joumal.pone.0079960.t003 
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Table 4. Consistency of codominant genotyping on replicate 2b-RAD libraries prepared from two parents and ten progeny. 





Genotyped from Replicate 2 


Genotyped from Replicate 1 


Homozygous (Parent) Heterozygous (Parent) 


Homozygous (Progeny) 


Heterozygous (Progeny) 


Same genotype 


1,527 1,578 


6,813 


5,307 


Different, homozygous 


0 8 


0 


401 


Different, heterozygous 


5 4 


150 


13 


Agreement (%) 


99.7% 99.2% 


98.1% 


92.8% 



Note, average sequencing depths for two parents were 1 81 x and 1 85 x in rep1 and 1 90 x and 235 x in rep2, while for progeny, they were 37-46 x in repl and 22-30 x 
in rep2. 

doi:1 0.1 371 /journal.pone.0079960.t004 



a subset of BsaXI fragments in the genome. 2b-RAD libraries were 
subject to single end sequencing (1 x50 bp) using an Illumina GA- 
II sequencer. The second dataset was retrieved from our ongoing 
linkage mapping project for Argopecten irradians irradians and 
Argopecten purpuratus, which was generated by 2b-RAD sequencing 
of an Fj hybrid family created by crossing A. irradians (Q) and A. 
purpuratus (CT). Similar to the first dataset, standard libraries were 
constructed for parents and RR libraries were constructed for 
progeny using adaptors with 5'-NNA-3' and 5 '-NNT- 3 'over- 
hangs. Replicate libraries were independently constructed for two 
parents and ten progeny, and then sequenced (1 x36 bp) in two 
separate sequencing runs on an Illumina HiSeq2000 sequencer. 
All of the 2b-RAD sequences were archived in the SRA database 
under accession numbers SRA065207 (first dataset) and 
SRP029614 (second dataset). 

RADtyping methodology 

RADtyping is a pipeline program that integrates all custom Perl 
scripts necessary for implementing de novo codominant and 
dominant genotyping algorithms. RADtyping can deal with both 
single-end and paired-end RAD sequencing data. The principle of 
its genotyping strategy is elaborated as follows. 

Paternal and maternal reads are first pooled together to 
assemble into exacdy matching read clusters (i.e. representing 
individual alleles), and then "allele" clusters are further merged 
into "locus" clusters by allowing two mismatches using the Ustacks 
program (parameters -m 3, -M 2; [4]). A collection of consensus 
sequences from all of the "locus" clusters comprises a set of 
representative reference sites that are further classified into parent- 
shared and parent-specific sites. 

For the parent-shared sites, cluster depth (a) approximately 
follows a mixed Poisson distribution due to the existence of 
composite clusters: 



Pr(rf|C)~ V aipoisson{d\iC) 



(1) 



where a ; = 1 and M represents the copy number for 

\<i<M 

repetitive sites. The parameters Cand can be estimated from 
the sequencing data using the expectation-maximization (EM) 
algorithm. To remove low-quality sites, reference sites are filtered 
to retain those supported by parental reads in sufficient depth (i.e. 
the requirements dpi>lpi and d p2 >lp2). The thresholds l p j and lp 2 
are determined by: 



lpi=msiiL<d\ poisson{m\Cj) <0.05 > 

I 0<m<d ) 



(2) 



where Cj is the mean sequencing depth of thej th parent (j = 1,2). 
To remove repetitive sites, parent-shared sites are filtered by 
excluding those with depths larger than L. The threshold L is 
determined by: 



(3) 



mmld\a\poisson(d\C\ + Ci)> ajpoisson(d\i(C\ + C2)) 



For the parent-specific sites, low-quality sites (i.e. dp!<l pI or 
dp2<lpi] are also removed. In addition, to avoid incorrect 
"absence" calls from low-coverage sites in the progeny, the 
reference sites are further filtered to remove those with d p „ less 
than lp„, where dp, 0 is calculated for each site by summarizing all 



Table 5. Consistency of dominant genotyping on replicate 2b-RAD libraries prepared from two parents and ten progeny. 



Genotyped from Replicate 2 



Genotyped from Replicate 1 


Absent (Parent) 


Present (Parent) 


Absent (Progeny) 


Present (Progeny) 


Same genotype 


3,972 


3,972 


14,133 


12,915 


Different, absent 




0 




316 


Different, present 


0 




112 




Agreement (%) 


100% 


100% 


99.2% 


97.6% 



Note, average sequencing depths for two parents were 1 81 x and 1 85 x in repl and 1 90 x and 235 x in rep2, while for progeny, they were 37-46 x in repl and 22-30 x 
in rep2. 

doi:1 0.1 371 /journal.pone.0079960.t005 
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progeny having reads derived from that site, and the threshold l /m , 
is determined for each site using formula (2). 

Once the high-quality reference sites are obtained, sequencing 
data from the parents and progeny are separately mapped against 
these sites using SOAP2 software (parameters -M 4, -v 2; [13]). 
For codominant genotyping, posterior probability is calculated for 
two possible genotypes (i.e. homozygote or heterozygote) at a given 
locus using a maximum likelihood approach [14]: 



LI =~Pr(ni,n2,n-s,ri4\homozygote) 

,(i-^r(^r 2+ " 3+ " 4 



L2 = Pr(w i ,w 2 ,«3 ,w 4 1 heterozygote) 



(4) 



n\ 

i Inline Ik 



(0.5- -)"i+ n 2(^:)"3+' i 4 
n\\n2 [ .nj,\nn\ 4 4 



where n j, n 2 , n 3 and n 4 are the read counts for each of the four 
possible nucleotides (A, T, C and G), n is the total number of reads 
and £ is the sequencing error rate. The genotype is assigned based 
on the result of a likelihood ratio test (LRT) between the two most 
likely hypotheses with one degree freedom. Using a significant 
level of a = 0.05, we assign the most likely genotype at the given 
locus; otherwise, the genotype is uncalled. 

For dominant genotyping, supposing that the cluster depth of 
the i th site for the j th progeny is d tp this site is genotyped as 
"presence" if d$>lj, "absence" if d tJ = 0, and "unknown" if d t jE(0, 
Q, where the threshold lj is determined using formula (2) with C 
representing the mean sequencing depth of the i th site. 

Genotype validation by Sanger sequencing 

To verify the genotypes obtained from the first 2b-RAD 
sequencing dataset, eight codominant and eight dominant markers 



were randomly selected for Sanger sequencing. The selected 
marker sequences were mapped to the aforementioned C. farreri 
reference genome to retrieve flanking sequences for primer design. 
Primers were designed to amplify a fragment (150-300 bp) 
flanking each target site (primer sequences are provided in 
Table 3). Each PGR amplification was performed in a 20-JJ.l 
volume composed of approximately 20 ng genomic DNA, 0.2 uM 
of each primer, 200 U.M of each dNTP, 1.5 mil MgCl 2 , 1 U of 
Taq DNA polymerase (Takara) and 1 x PCR buffer. All cycling 
programs began with an initial denaturation at 95°C for 5 min, 
followed by 26-30 cycles of 95°C for 30 s, 60°C for 30 s, 72°C for 
30 s and a final extension at 72°C for 5 min. Each PCR product 
was run on a 1.5% agarose gel to determine the success of the 
PCR. PCR products amplified from two parents and four progeny 
were purified using the QIAquick PCR purification kit (Qiagen), 
and then were sequenced using the Sanger method. 

Supporting Information 

Table SI Evaluation of the de novo RADtyping ap- 
proach using a pseudo Fi mapping population. The 

simulated population was created by crossing two Arabidopsis plants 
with predefined SNPs in their genomes, and was subject to in silico 
sequencing together with their parents at different sequencing 
depths with sequencing errors enabled. De novo codominant and 
dominant genotyping was evaluated in three key aspects: genotype 
coverage (a, b), removal of repetitive sites (b, e), and genotyping 
accuracy (c, f). 
(PDF) 
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